Is there a significant difference in income between men and women? Does the difference vary depending on other factors (e.g educaiton, marital status, criminal history, drug use, childhood household factors, profession, etc.)?
To answer this question, I looked at data from the National Longitudinal Survey of Youth, 1979 cohort. I began by cleaning the data and followed by creating tables and bar charts to explore the data.
#Install packages
library(kableExtra)
library(plyr)
library(ggplot2)
library(knitr)
library(MASS)
options(scipen=4)
#Retrieve data
nlsy <- read.csv("http://www.andrew.cmu.edu/user/achoulde/94842/final_project/nlsy79/nlsy79_income.csv", header=TRUE)#Subset variables
nlsy.var <- c("R0000700", "R0001200", "R0214700", "R0214800", "R0217502", "T3977400", "R3401501")
#Recode variables
nlsy.fp <- nlsy[nlsy.var]
colnames(nlsy.fp) <- c("country.birth", "foreign.lang.spoken", "race", "sex", "fam.size", "income", "highest.grade")
str(nlsy.fp)## 'data.frame': 12686 obs. of 7 variables:
## $ country.birth : int 1 2 1 1 1 1 1 1 1 1 ...
## $ foreign.lang.spoken: int 4 4 -4 -4 -4 -4 -4 4 -4 -4 ...
## $ race : int 3 3 3 3 3 3 3 3 3 3 ...
## $ sex : int 2 2 2 2 1 1 1 2 1 2 ...
## $ fam.size : int 5 5 5 5 4 4 3 3 6 3 ...
## $ income : int -5 19000 35000 -5 -5 105000 -5 40000 75000 -5 ...
## $ highest.grade : int -5 12 10 14 -5 16 12 12 14 9 ...
#Convert variables to factors and recode
nlsy.fp <- mutate(nlsy.fp,
country.birth = as.factor(mapvalues(country.birth,
c(1, 2, -3),
c("In the US", "In other country", NA))),
foreign.lang.spoken = as.factor(mapvalues(foreign.lang.spoken,
c(1, 2, 3, 4,-4, -3, -2),
c("Spanish", "French", "German", "Other", "No Foreign Language", "No Foreign Language", "No Foreign Language"))),
race = as.factor(mapvalues(race,
c(1, 2, 3),
c("Hispanic", "Black", "Non-Black, Non-Hispanic"))),
sex = as.factor(mapvalues(sex,
c(1, 2),
c("Male", "Female"))),
highest.grade = as.factor(mapvalues(highest.grade,
c(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 95, -3, -5),
c("None", "1st Grade", "2nd Grade", "3rd Grade", "4th Grade", "5th Grade", "6th Grade","7th Grade", "8th Grade", "9th Grade", "10th Grade", "11th Grade", "12th Grade", "1st Year College", "2nd Year College", "3rd Year College", "4th Year College", "5th Year College", "6th Year College", "7th Year College", "8th Year College or More", NA, NA, NA))))## The following `from` values were not present in `x`: 2, 95
#Reorder grades variable
nlsy.fp$highest.grade <- factor(nlsy.fp$highest.grade , levels = c("None", "1st Grade", "2nd Grade", "3rd Grade", "4th Grade", "5th Grade", "6th Grade","7th Grade", "8th Grade", "9th Grade", "10th Grade", "11th Grade", "12th Grade", "1st Year College", "2nd Year College", "3rd Year College", "4th Year College", "5th Year College", "6th Year College", "7th Year College", "8th Year College or More", NA, NA, NA))
#Remove non-positives
nlsy.fp$income[nlsy.fp$income < 0] <- NA
#Remove topcoded variables
nlsy.fp$income[nlsy.fp$income == 343830] <- NA
#Remove NAs in variables
nlsy.fp <- subset(nlsy.fp, !is.na(country.birth), select=country.birth:highest.grade) #Country of birth
nlsy.fp <- subset(nlsy.fp, !is.na(highest.grade), select=country.birth:highest.grade) #Highest grade completed
#Non-positive values from the income variable were removed due to invalid responses to the survey. The top 2 percent of the income variable were also removed to keep the data to a manageable size. The top 2 percent were so far removed from the rest of the data that its presence distorted graphs and charts. #Summary of the clean data
summary(nlsy.fp)## country.birth foreign.lang.spoken
## In other country: 687 French : 86
## In the US :9711 German : 71
## No Foreign Language:8221
## Other : 422
## Spanish :1598
##
##
## race sex fam.size
## Black :2714 Female:5305 Min. : 1.000
## Hispanic :1719 Male :5093 1st Qu.: 3.000
## Non-Black, Non-Hispanic:5965 Median : 4.000
## Mean : 4.552
## 3rd Qu.: 6.000
## Max. :15.000
##
## income highest.grade
## Min. : 0 12th Grade :4513
## 1st Qu.: 100 4th Year College:1241
## Median : 28800 2nd Year College: 889
## Mean : 35149 1st Year College: 873
## 3rd Qu.: 54000 11th Grade : 495
## Max. :178000 3rd Year College: 461
## NA's :3807 (Other) :1926
##average income by sex
in.sex<- ddply(nlsy.fp, ~ sex, summarize,
income.sex = mean(income, na.rm = TRUE))
kable(in.sex, format = "html")%>%
kable_styling(bootstrap_options = c("striped", "hover"))| sex | income.sex |
|---|---|
| Female | 28731.39 |
| Male | 42323.52 |
From an initial view of average income by sex, Males earned 42323.52 and Females earned 28731.39 with a difference of 13592.14 dollars.
#Table of the number of respondents broken down by gender and race
simpl.tbl <- addmargins(with(nlsy.fp, table(as.array(sex), as.array(race))))
kable(simpl.tbl, format = "html") %>%
kable_styling(bootstrap_options = c("striped", "hover"))| Black | Hispanic | Non-Black, Non-Hispanic | Sum | |
|---|---|---|---|---|
| Female | 1352 | 868 | 3085 | 5305 |
| Male | 1362 | 851 | 2880 | 5093 |
| Sum | 2714 | 1719 | 5965 | 10398 |
I also wanted to analyze if a difference of income existed among races. But first, I wanted to know how many respondents were within each category. From the 10398 respondents of the survey, there were 5093 Male respondents and 5305 Female respondents. Among the Males there were: 1362 Black respondents, 851 Hispanic respondents, and 2880 Non-Black, Non-Hispanic respondents. Among the Females there were: 1352 Black respondents, 868 Hispanic respondents and 3085 Non-Black, Non-Hispanic respondents.
##Average income by race
in.race<- ddply(nlsy.fp, ~ race, summarize,
income.race = mean(income, na.rm= TRUE))
kable(in.race, format = "html") %>%
kable_styling(bootstrap_options = c("striped", "hover"))| race | income.race |
|---|---|
| Black | 26915.25 |
| Hispanic | 32768.26 |
| Non-Black, Non-Hispanic | 41241.44 |
#Create a bar chart
ggplot(data = in.race,
aes(x = race,
y = income.race)) +
geom_bar(stat = "identity") +
xlab("") +
ylab("Average Income") +
ggtitle("Average Income by Race") After understanding how the data is broken up by gender, a table of average income by race was calculated. Non-Black, Non-Hispanics, on average, earned the most with 41241.44 dollars. Hispanic respondents earned the second highest with 32768.26 dollars and Black respondents earned 26915.25 dollars.
#Average income by sex and race table
income.race.sex<- with(nlsy.fp, round(tapply(income, INDEX = list(race, sex), FUN = mean, na.rm= TRUE)))
kable(income.race.sex, format = "html")%>%
kable_styling(bootstrap_options = c("striped", "hover"))| Female | Male | |
|---|---|---|
| Black | 24161 | 29899 |
| Hispanic | 27422 | 38708 |
| Non-Black, Non-Hispanic | 32030 | 51770 |
By breaking the data between genders, Males at each of the three races earned more than Females. Non-Black, Non-Whites Males earned the most between the 6 groups with 51770 dollars, compared to 32030 dollars for Non-Black, Non-Hispanics Females. Hispanic males earned 38708 dollars compared to 27422 dollars Hispanic Females. Black Males earned 29899 dollars compared to 24161 dollars Black Females.
#Create a dataframe
income.race.sex.dif <- ddply(nlsy.fp, ~ race, summarize,
in.gap = mean(income[sex == "Male"], na.rm = TRUE) - mean(income[sex == "Female"], na.rm = TRUE))
#Plot data frame
income.race.sex.dif.plot <- ggplot(data = income.race.sex.dif, aes(x = race, y = in.gap, fill = race)) +
geom_bar(stat = "identity") +
xlab("") +
ylab("Difference of Income") +
ggtitle("Difference of Income between Men and Women") +
guides(fill = FALSE)
income.race.sex.dif.plot To get a better understanding of how large the difference is between the races, a bar chart of the difference of average income is plotted. The higher the bar, the larger the difference. There seems to be a larger difference in average income among Non-Black, Non-Hispanic respondents.
# Create a data frame
base.df<- ddply(nlsy.fp, ~ race + sex, summarize,
mean.income = mean(income, na.rm= TRUE),
size= length(income),
standard.deviation= sd(income, na.rm=T),
standard.error.mean.income= round(standard.deviation/ sqrt(size),2),
lower = mean.income - standard.deviation,
upper= mean.income + standard.deviation)
kable(base.df, format = "html")%>%
kable_styling(bootstrap_options = c("striped", "hover"))| race | sex | mean.income | size | standard.deviation | standard.error.mean.income | lower | upper |
|---|---|---|---|---|---|---|---|
| Black | Female | 24161.04 | 1352 | 26882.23 | 731.10 | -2721.1861 | 51043.27 |
| Black | Male | 29898.98 | 1362 | 33231.11 | 900.44 | -3332.1266 | 63130.09 |
| Hispanic | Female | 27422.25 | 868 | 29441.79 | 999.32 | -2019.5415 | 56864.05 |
| Hispanic | Male | 38708.27 | 851 | 35973.71 | 1233.16 | 2734.5639 | 74681.98 |
| Non-Black, Non-Hispanic | Female | 32029.78 | 3085 | 32502.83 | 585.19 | -473.0494 | 64532.62 |
| Non-Black, Non-Hispanic | Male | 51769.92 | 2880 | 40176.26 | 748.64 | 11593.6511 | 91946.18 |
The standard deviation at every race also shows that there is more variability between Males than Females. For example, the standard deviation of income for Black Males is 33231.11 dollars compared to Black Females of 26882.23 dollars. This shows that among Black Males, there is greater variability to earn income than Black Females. The standard error of the mean for Females at each of the 3 races shows that there is a significant difference between genders, for example the mean standard error for Hispanic Males income is 1233.16 dollars compared to the mean standard error for Hispanic Females income of 999.32 dollars. The smaller the number, the more representative the value is of the true population.
The variables I chose to answer this question were: - Whether the respondent was born in the United States I wanted to use this question to help determine if being an U.S. citizen had a positive effect on income. It is easier for U.S. citizens to get jobs than non U.S citizens due to sponsorship.
If a foreign language was spoken at home during resondent’s childhood? Does being knowing another language or being around a foreign language improve one’s chances of earning a higher income.
Respondent’s race/etnicity Race is going to be an important factor on the difference of income among sexes based upon years of racial segregation and prohibiting minorities from advancing higher employment opportunities.
Sex of Respondent Is there a difference of income between sexes?
Family Size Does having a bigger family improve income?
Income Is there a difference of income between sexes?
Highest Grade Completed- Does income increase based on the number of years of education?
Within the income variable, the top 2% of highest incomes were “top coded” which means that we do not see the actual incomes for the 2% of earners. For the top 2% of earners, the income variable is the average income of the 2% of earners. After exploring the bar charts and tables, I removed the top coded variables because the values were too high that it altered the charts. They also seemed to be outliers and not an indication as to whether they would contribute to the study.
After careful analysis, I also removed all non positives from my data. I did recode some of the negative values based on the data dictionary, such as the foreign language question to ‘no foreign language spoken’. The rest of the non-positive values were small quantities that it would not have impacted the rest of the study.
There were 212 more Female respondents in the study but out of the 10398 total respondents, the distribution seemed fine. After breaking it down by race, the data seem to be reflect the current population by having more Non Black, Non Hispanic respondents, followed by Black respondents and ending with Hispanic respondents.
The graph titled “Difference of Income between Males and Females” was interesting to see because I expected there to be a difference but not large enough like the one between Non Black, Non Hispanic Males and Females.
I was also surprised to see that in each of the 3 races, there was a noticeable difference of income between Males and Females. I had expected that there would be very little if any difference among the Black and Hispanic race.
The next section will go into the main findings and how these variables impact the income between Males and Females.
#plot data
base.df.plot <- ggplot(base.df,
aes(x= sex,
y= mean.income,
fill= race)
) +
geom_col(stat = "identity", position= "dodge") +
geom_errorbar(aes(ymin = lower,
ymax = upper),
width = .2, position= position_dodge(0.9)) +
xlab("") +
ylab("Average Income") +
ggtitle("Average Income Distribution by Gender Among Race")## Warning: Ignoring unknown parameters: stat
base.df.plotThe error bars show that I am 95 percent confident that the percentage of Non Black, Non Hispanic respondents are between 11593.65 and 91946.18.
# qq plot
with(nlsy.fp, qqnorm(income[sex=="Male"], main = "Normal Q-Q Plot for Males"))
# add reference line
with(nlsy.fp, qqline(income[sex=="Male"], col = "red")) To assess if the observed data follows a normal distribution, a QQ plot was performed. Among Males, the data is not perfectly normal. The upward curving suggests that there is a high positive skew and the bottom suggests that there are many low values.
# qq plot
with(nlsy.fp, qqnorm(income[sex=="Female"], main = "Normal Q-Q Plot for Females"))
# add reference line
with(nlsy.fp, qqline(income[sex=="Female"], col = "red"))The the QQplot for Females follows the same distribution as it is not perfectly normal. The top values are more off normal than Males which also suggests that there is high positive skewness.
Sex T.test
income.sex.t.test <- t.test(income ~ sex, data = nlsy.fp)
income.sex.t.test##
## Welch Two Sample t-test
##
## data: income by sex
## t = -15.754, df = 5911.6, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -15283.48 -11900.80
## sample estimates:
## mean in group Female mean in group Male
## 28731.39 42323.52
options(scipen=4)To tests the averages between two groups, a t test was done. The p-value for this t test is 0 and the t statistic is -15.75, which suggests that the null hypothesis of there being no difference among income by gender is rejected. This t- test suggests that Males do have higher income. The confidence intervals are -15283.48 and -11900.8 which suggests that if this sample data was represented of the larger population, there is a 95% confidence that the gender difference of income is between those values. Next, the mean income for Females is 28731.39 and the mean income for Males is 42323.52.
The next three t tests take a look based on race:
Non-Black, Non-Hispanic t.test
income.sex.t.test2 <- t.test(income ~ sex, data = subset(nlsy.fp, race == "Non-Black, Non-Hispanic"))
income.sex.t.test2##
## Welch Two Sample t-test
##
## data: income by sex
## t = -15.3, df = 2928.9, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -22269.87 -17210.39
## sample estimates:
## mean in group Female mean in group Male
## 32029.78 51769.92
Among the Non-Black, Non-Hispanic race, the p-value for the t-test is 0 and the t statistic is -15.3, which suggests that the null hypothesis of there being no difference among income by gender is rejected. This t-test suggests that Non-Black, Non-Hispanic Males do have higher income. The confidence intervals are -22269.87 and -17210.39 which suggests that if this sample data was represented of the larger population, there is a 95% confidence that the gender difference of income is between those values. Next, the mean income for Non-Black, Non-Hispanic Females is 32029.78 and the mean income for Non-Black, Non-Hispanic Males is 51769.92.
Hispanics t.test
income.sex.t.test3 <- t.test(income ~ sex, data = subset(nlsy.fp, race == "Hispanic"))
income.sex.t.test3##
## Welch Two Sample t-test
##
## data: income by sex
## t = -6.0851, df = 1165.5, p-value = 1.577e-09
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -14924.923 -7647.113
## sample estimates:
## mean in group Female mean in group Male
## 27422.25 38708.27
Among the Hispanic race, the p-value for the t-test is 0 and the t statistic is -6.09, which suggests that the null hypothesis of there being no difference among income by gender is rejected. This t-test suggests that Hispanic Males do have higher income. The confidence intervals are -14924.92 and -7647.11 which suggests that if this sample data was represented of the larger population, there is a 95% confidence that the gender difference of income is between those values. Next, the mean income for Hispanic Females is 27422.25 and the mean income for Hispanic Males is 38708.27.
Blacks t.test
income.sex.t.test4 <- t.test(income ~ sex, data = subset(nlsy.fp, race == "Black"))
income.sex.t.test4##
## Welch Two Sample t-test
##
## data: income by sex
## t = -4.2766, df = 1892, p-value = 0.00001992
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -8369.317 -3106.560
## sample estimates:
## mean in group Female mean in group Male
## 24161.04 29898.98
Among the Black race, the p-value for the t-test is 0 and the t statistic is -4.28, which suggests that the null hypothesis of there being no difference among income by gender is rejected. This t-test suggests that Black Males do have higher income. The confidence intervals are -8369.32 and -3106.56 which suggests that if this sample data was represented of the larger population, there is a 95% confidence that the gender difference of income is between those values. Next, the mean income for Black Females is 24161.04 and the mean income for Black Males is 29898.98.
gender.income.boxplot <- ggplot( data = nlsy.fp,
aes(x = sex,
y = income, na.rm = T,
fill = race)) +
geom_boxplot() +
xlab("") +
ylab("Income") +
ggtitle("Income Distribution by Gender Among Race")
gender.income.boxplot## Warning: Removed 3807 rows containing non-finite values (stat_boxplot).
This box plot shows a better representation of where the respondents average income is separated by gender and race. The average income among Females lies relatively close to each other with Black females earning 24161 dollars and Non-Black, Non-Hispanic earning the most with 32030 dollars. There are larger gaps between average incomes among Males. Black males average income is the lowest with 29899 dollars and at about the same level as Hispanic Females with 27422. The average income for Non Black, Non Hispanic males is the largest at 51770 dollars and is over the third quartile of Black Females. There are some outliers in each box plot and most of them are above the $150,000 range.
#create a dataframe
sex.country.income<- ddply(nlsy.fp, ~ country.birth + sex, summarize, mean.income = mean(income, na.rm= TRUE))
#subset dataframe
sex.country.income<- subset(sex.country.income, !is.na(country.birth), select= c(country.birth, sex, mean.income))
#plot dataframe
sex.country.income.plot <- ggplot(data = sex.country.income,
aes(x = country.birth,
y = mean.income,
fill = sex)) +
geom_histogram(stat = "identity", position= "dodge") +
xlab("") +
ylab("Average Income") +
ggtitle("Average Income by Country of Birth Among Gender")+
scale_fill_brewer(palette = "Spectral")## Warning: Ignoring unknown parameters: binwidth, bins, pad
sex.country.income.plot The bar graph shows that Males continue to earn more than Females, even when the data is broken up by nationality. It is also interesting to note that Females not born in the the U.S. on average, earn 184.67 more than Females born in the U.S. The difference is larger for Males. On average, Males born in another country earn 2686.26 dollars more than Males born in the U.S.
sex.language.income <- ddply(nlsy.fp, ~ foreign.lang.spoken + sex, summarize, mean.income = mean(income, na.rm= TRUE))
sex.language.income.plot <- ggplot(data = sex.language.income,
aes(x = foreign.lang.spoken,
y = mean.income,
fill = sex)) +
geom_histogram(stat = "identity", position= "dodge") +
xlab("Foreign Language Spoken at Home") +
ylab("Average Income") +
ggtitle("Average Income by Foreign Language Spoken at Home Among Gender") +
scale_fill_brewer(palette = "Spectral")## Warning: Ignoring unknown parameters: binwidth, bins, pad
sex.language.income.plot The same trend continues when looking at the difference of income among gender broken down by foreign language spoken at home. No foreign Language spoken at home, suggesting English only speaking, is the same trend we have been noticing. The biggest difference between genders is among the ‘Other’ language with a difference of 21326.28 followed by a difference of German language of 20188.53.
ggplot(nlsy.fp, aes(x= foreign.lang.spoken,
y= income,
color= sex)) +
geom_jitter(alpha = .25) +
ylab( "Income") +
xlab("Foreign Language Spoken at Home") +
ggtitle("Income by Foreign Language Spoken at Home") ## Warning: Removed 3807 rows containing missing values (geom_point).
Using the jitter function, we can see the depth of each language. There are significantly more dots in the ‘No foreign Language’ spoken than any others. Spanish comes second. The differences mentioned above regarding German and Other have less values than ‘No Foreign Language’ and Spanish. Within this graphic, Males continue to earn more than Females.
##average income by race and grade
sex.grade.income <- ddply(nlsy.fp, ~ sex + highest.grade, summarize,
mean.income = round(mean(income, na.rm = TRUE),digits = 2))
#Reorder Grades
sex.grade.income$highest.grade <- factor(sex.grade.income$highest.grade , levels = c("None", "1st Grade", "2nd Grade", "3rd Grade", "4th Grade", "5th Grade", "6th Grade","7th Grade", "8th Grade", "9th Grade", "10th Grade", "11th Grade", "12th Grade", "1st Year College", "2nd Year College", "3rd Year College", "4th Year College", "5th Year College", "6th Year College", "7th Year College", "8th Year College or More", NA, NA, NA))
sex.grade.income.plot <- ggplot(data = sex.grade.income,
aes(x = highest.grade,
y = mean.income,
fill = sex)) +
geom_bar(stat = "identity", position= "dodge") +
xlab("Highest Grade Completed") +
ylab("Average Income") +
ggtitle("Average Income by Highest Grade Completed Among Gender") +
theme(axis.text.x = element_text(angle = 60, hjust = 1)) +
scale_fill_brewer(palette = "Spectral")
sex.grade.income.plot## Warning: Removed 2 rows containing missing values (geom_bar).
By taking a look at education, there is a correlation with average income and years of education. It appears that the more years of education one has, the more income they earn. Still, the chart shows that Males on average, earn more than Females at every year of education. Although it seems the gap shrinks when Females have 7 or more years of college education. There are two majors jumps between both genders: (1) after receiving 12 years of education (graduating with a high school diploma) and (2) after receiving 4 years of college education (graduating with a Bachelor’s). The disparity between the sexes seems to widen after graduating high school. It is interesting to note there is an outlier for the 3rd grade.
#Scatter plot
ggplot(nlsy.fp, aes(x=as.numeric(highest.grade),
y=income,
color=sex)) +
geom_jitter(alpha = .25) +
geom_smooth()+
ylab( "Income") +
xlab("Highest Grade Completed") +
ggtitle("Income by Highest Grade Completed") ## `geom_smooth()` using method = 'gam'
## Warning: Removed 3807 rows containing non-finite values (stat_smooth).
## Warning: Removed 3807 rows containing missing values (geom_point).
This chart shows the same information but looks at the total values among Males and Females. Years 12 and 16, high school diploma and Bachelor’s degree, respectively are clearly shown. You can also see the trend increasing for both Males and Females with the curvature of the lines.
final.regression <- lm(income ~ ., data= nlsy.fp)
options(scipen=15)
kable(summary(final.regression)$coef, digits = c(3, 3, 3, 4))| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | -1509.469 | 22378.191 | -0.067 | 0.9462 |
| country.birthIn the US | -5039.243 | 1735.719 | -2.903 | 0.0037 |
| foreign.lang.spokenGerman | -6422.786 | 6416.237 | -1.001 | 0.3169 |
| foreign.lang.spokenNo Foreign Language | -2323.848 | 4389.581 | -0.529 | 0.5965 |
| foreign.lang.spokenOther | -1844.781 | 4914.013 | -0.375 | 0.7074 |
| foreign.lang.spokenSpanish | -7535.432 | 4910.354 | -1.535 | 0.1249 |
| raceHispanic | 11841.049 | 2477.168 | 4.780 | 0.0000 |
| raceNon-Black, Non-Hispanic | 10206.605 | 907.523 | 11.247 | 0.0000 |
| sexMale | 15387.589 | 766.079 | 20.086 | 0.0000 |
| fam.size | -24.046 | 174.238 | -0.138 | 0.8902 |
| highest.grade1st Grade | -6466.938 | 37859.896 | -0.171 | 0.8644 |
| highest.grade3rd Grade | 12988.142 | 23953.721 | 0.542 | 0.5877 |
| highest.grade4th Grade | -1838.142 | 25857.062 | -0.071 | 0.9433 |
| highest.grade5th Grade | 2517.396 | 25234.641 | 0.100 | 0.9205 |
| highest.grade6th Grade | 4530.280 | 22495.280 | 0.201 | 0.8404 |
| highest.grade7th Grade | 4354.194 | 22299.914 | 0.195 | 0.8452 |
| highest.grade8th Grade | 5117.236 | 22039.643 | 0.232 | 0.8164 |
| highest.grade9th Grade | 8599.148 | 21961.984 | 0.392 | 0.6954 |
| highest.grade10th Grade | 11835.556 | 21949.801 | 0.539 | 0.5898 |
| highest.grade11th Grade | 13190.014 | 21937.622 | 0.601 | 0.5477 |
| highest.grade12th Grade | 24564.203 | 21879.699 | 1.123 | 0.2616 |
| highest.grade1st Year College | 33141.560 | 21909.301 | 1.513 | 0.1304 |
| highest.grade2nd Year College | 37319.086 | 21910.684 | 1.703 | 0.0886 |
| highest.grade3rd Year College | 40066.387 | 21943.567 | 1.826 | 0.0679 |
| highest.grade4th Year College | 53515.006 | 21906.261 | 2.443 | 0.0146 |
| highest.grade5th Year College | 56257.657 | 22014.683 | 2.555 | 0.0106 |
| highest.grade6th Year College | 56316.646 | 22072.948 | 2.551 | 0.0108 |
| highest.grade7th Year College | 73701.571 | 22337.726 | 3.299 | 0.0010 |
| highest.grade8th Year College or More | 90254.902 | 22618.425 | 3.990 | 0.0001 |
final.regression.coef <- round(summary(final.regression)$coef, 4)
class(final.regression.coef)[1] “matrix”
attributes(final.regression.coef)$dim [1] 29 4
$dimnames $dimnames[[1]][1] “(Intercept)”
[2] “country.birthIn the US”
[3] “foreign.lang.spokenGerman”
[4] “foreign.lang.spokenNo Foreign Language” [5] “foreign.lang.spokenOther”
[6] “foreign.lang.spokenSpanish”
[7] “raceHispanic”
[8] “raceNon-Black, Non-Hispanic”
[9] “sexMale”
[10] “fam.size”
[11] “highest.grade1st Grade”
[12] “highest.grade3rd Grade”
[13] “highest.grade4th Grade”
[14] “highest.grade5th Grade”
[15] “highest.grade6th Grade”
[16] “highest.grade7th Grade”
[17] “highest.grade8th Grade”
[18] “highest.grade9th Grade”
[19] “highest.grade10th Grade”
[20] “highest.grade11th Grade”
[21] “highest.grade12th Grade”
[22] “highest.grade1st Year College”
[23] “highest.grade2nd Year College”
[24] “highest.grade3rd Year College”
[25] “highest.grade4th Year College”
[26] “highest.grade5th Year College”
[27] “highest.grade6th Year College”
[28] “highest.grade7th Year College”
[29] “highest.grade8th Year College or More”
$dimnames[[2]][1] “Estimate” “Std. Error” “t value” “Pr(>|t|)”
Interpretation
There are several statistical predictors of income, such as being born in the U.S., race, and education. The p- value for someone born in the U.S is 0.0037, the p-value for being Hispanic is 0, the p-value for Non-Black, Non-Hispanics with a p value of 0, the p-value for Males is 0, the p-vale for receiving at least 2 years of college education is 0.0886, the p-vale for receiving at least 3 years of college education is 0.0679, the p-vale for receiving at least 4 years of college education is 0.0146, the p-vale for receiving at least 5 years of college education is 0.0106, the p-vale for receiving at least 6 years of college education is 0.0108, the p-vale for receiving at least 7 years of college education is 0.001, and the p-vale for receiving at 8 years of college education or more is 0.0001. All of the coefficients with the exception of country.birthIn the US, are positive suggesting that there is a positive relationship.
The baseline for this regression is Black Females who were not born in the U.S and who were raised speaking French as a child. The interpretations are made off of this baseline. Education is a large driver for income, followed by race and country of origin.
8 years or more of college education: The coefficient is 90254.9, which means that respondents that have 8 years of college education or more, their income, on average, is about 90254.9 dollars higher than the base line.
7 years of college education: The coefficient is 73701.57, which means that respondents that have 7 years of college education, their income, on average, is about 73701.57 dollars higher than the base line.
6 years of college education: The coefficient is 56316.65, which means that respondents that have 6 years of college education, their income, on average, is about 56316.65 dollars higher than the base line.
5 years of college education: The coefficient is 56257.66, which means that respondents that have 5 years of college education, their income, on average, is about 56257.66 dollars higher than the base line.
4 years of college education: The coefficient is 53515.01, which means that respondents that have 4 years of college education, their income, on average, is about 53515.01 dollars higher than the base line.
3 years of college education: The coefficient is 40066.39, which means that respondents that have 3 years of college education, their income, on average, is about 40066.39 dollars higher than the base line.
2 years of college education: The coefficient is 37319.09, which means that respondents that have 2 years of college education, their income, on average, is about 37319.09 dollars higher than the base line.
Male Sex: The coefficient is 15387.59, which means that on average, Males make about 15387.59 dollars more than the baseline.
Non Black, Non Hispanic: The coefficient is 10206.6, which means that Non Black, Non Hispanics on average, make about 10206.6 dollars more than the baseline.
Hispanic: The coefficient is 11841.05, which means that Hispanics on average, make about 11841.05 dollars more than the baseline.
Born in the U.S.: The coefficient is -5039.24, which means that those born in the U.S. on average, make about -5039.24 dollars less than the baseline.
#interaction
regression.interact <- lm(income ~ sex * race, data = nlsy.fp)
kable(summary(regression.interact)$coef, digits = c(3, 3, 3, 4))| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 24161.042 | 1034.641 | 23.352 | 0.0000 |
| sexMale | 5737.938 | 1493.375 | 3.842 | 0.0001 |
| raceHispanic | 3261.210 | 1665.432 | 1.958 | 0.0503 |
| raceNon-Black, Non-Hispanic | 7868.743 | 1313.460 | 5.991 | 0.0000 |
| sexMale:raceHispanic | 5548.080 | 2413.665 | 2.299 | 0.0216 |
| sexMale:raceNon-Black, Non-Hispanic | 14002.193 | 1906.082 | 7.346 | 0.0000 |
regression.interact.coef <- round(summary(regression.interact)$coef, 4)
class(regression.interact.coef)[1] “matrix”
attributes(regression.interact.coef)$dim [1] 6 4
$dimnames $dimnames[[1]][1] “(Intercept)”
[2] “sexMale”
[3] “raceHispanic”
[4] “raceNon-Black, Non-Hispanic”
[5] “sexMale:raceHispanic”
[6] “sexMale:raceNon-Black, Non-Hispanic”
$dimnames[[2]][1] “Estimate” “Std. Error” “t value” “Pr(>|t|)”
More Interpretation
To look at specific variables, an interaction on sex and race was conducted. All p-values were statistically significant. On average Black Males earned 5737.94 dollars more than Black Females. On average Hispanic Males earned 11286.02 dollars more than Black Females. On average Non Black, Non Hispanic Males earned 19740.13 dollars more than Black Females.
In summary, Males do earn more than Females. The difference varies once it is broken up my race. The difference is larger for Non-Black, Non-Hispanics, followed by Hispanics, and then Blacks. The disparity still exists when the data is broken up by education. Although, the more education one has, the more income they earn, the gap in income still exists. Furthermore, when the data is broken up into country of origin, the gap is still there. Although, it seems that being born in the country has a negative effect on income compared to those being born outside of the country. Family size was not a statistically significant predictor of income.There may be potential confounders for type of industry one works in or length of one’s employment, I have confidence in my conclusion, but further analysis would have to be done to investigate those confounders before presenting the findings to policymakers.